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Abstract 

The Bouma2 algorithm attempts to challenge the prevalent "stateful" ex- 
act string-match paradigms by suggesting a "quasi-stateless" approach. We 
claim that using state-machines to solve the multiple exact string-match 
problem introduces a hidden artificial constraint, namely the Consume- 
Order Dependency, which results in unnecessary overhead. Bouma2 is not 
restricted in this sense; we postulate that this allows memory-efficiency and 
improved performance versus its state-machine equivalents. The heart of the 
Bouma2 preprocessing problem is formulated as a weighted Integer Linear 
Programming problem, that can be tuned for memory footprint and perfor- 
mance optimization. Specifically, this allows Bouma2 to be input-sensitive, 
as tuning can be based on input characteristics. Evaluating Bouma2 against 
the Aho-Corasick variant of the popular Snort Intrusion Prevention System, 
we demonstrate double the throughput while using about 10% of the memory. 

Keywords: pattern match, hash function, integer linear programming, 
motif, deep packet inspection, Snort 



1. Introduction 

Multiple exact string-match is a classical problem with a vast range of ap- 
plications and solutions. This paper describes the Bouma2 algorithm, which 
takes a somewhat unorthodox approach to this problem. The most significant 
difference is that Bouma2 is Consume- Order Agnostic: the majority of 
existing algorithms restrict their match procedure to consume symbols in 
a predefined order (usually left-to- right or right-to- left). This restriction is 
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by no means implied from the definition of the problem (see Definition 1). 
Indeed, existing algorithms would require certain data-structures to be built 
for left-to-right searches, but these data-structures would usually be useless 
for right-to-left searches over the same input. With Bouma2, the same data- 
structures can be used for matching in any sequence. We believe that this 
difference is what makes Bouma2 so efficient in memory: many redundant 
scenarios in the "event-horizon" of the match-procedure do not require con- 
sideration in the Bouma2 data structures. 

In order to explain Bouma2 more easily, we try to intuitively relate it with 
some cognitive models that describe human word recognition. The following 
sentence can be easily understood by a person with basic English reading 
skills: 

"If you can raed tihs, 
tehn you are prbbolay not a sttae-mhciane. " 

The spelling mistakes in the text are also immediately noticeable. It is 
claimed that the brain first identifies the contour, or Bouma Shape [1, 2, 3], 
of each word, already associating it with its possible meaning, and only then 
the spelling mistakes are considered. Conversely, a state-machine capable of 
recognizing every correctly-spelled word through a single left-to-right pass 
over the text would simply yield a no- match result. An alternative exact- 
match algorithm that maintains "hints" for every word (e.g. for the above 
sentence the first and last character in every word) would still return a no- 
match result for misspelled words, but would also be capable of reporting 
"false-positives" (which may roughly be associated with spelling mistakes). 
This would have to be at the expense of traversing parts of the text more than 
once, contrary to the one-pass behavior that is typical of state-machines. 

Bouma2 follows similar concepts: it searches first for Motifs 1 , which 
are 2-symbol substrings of any one of the sought strings. For every motif 
occurrence in the input string, symbols around the motif location are exam- 
ined to corroborate the match. Bouma2 maintains a mapping between the 
original pattern strings and their corresponding motifs. For every pattern 
string there are 2 mappings: one mapping to a motif at an even offset from 
the beginning of the pattern string, and one mapping to an odd-offset motif. 



lr The term Motif was adopted from Computational Biology [4], where Sequence Motifs 
are used in DNA sequence analysis schemes. 
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Obviously, many pattern strings may be mapped to the same motif. This 
would require a resolving process when such a motif is located in the input. 
For efficient resolving we have developed a new data-structure termed the 
Mangled Trie (see Section 4.3). 

This double mapping scheme allows the match procedure to advance in 
2-symbol strides, and still find matches at any given offset. This, and the 
fact that every motif match is handled separately, together allow Bouma2 
to be Consume- Order Agnostic: as long as all the input is consumed, 
and all the motif matches are accounted for, there is no importance to the 
exact order in which the input traversal is performed. Indeed, the match 
process could be completely parallelized by accessing all non-overlapping 2- 
symbol sequences at once; this obviously hints at a very efficient hardware 
implementation of this algorithm. 

The mapping of the set of strings to their corresponding motifs can be 
viewed as a hash-function. For large numbers of strings, there may be millions 
of different valid motif-sets and hash-functions. We optimize the selection of 
the motif-set by means of formulating the problem as an Integer Linear 
Programming problem and solving it using the Branch- and- Cut algo- 
rithm. Furthermore, weights can be applied to every potential motif in order 
to optimize memory, performance etc. 

The remainder of this paper is organized as follows: Section 2 surveys 
related work, with emphasis on consume-order dependency; Section 3 defines 
basic concepts and notations; Section 4 presents the 3 parts of the Bouma2 
preprocessing stage; Section 5 describes the match process and explains its 
separation to Fast-Path and Slow-Path; Section 6 documents benchmarks 
done against the Snort® IPS software; Section 7 provides conclusions and 
Section 8 describes future work. In Appendix A we provide the Mangled- 
Trie construction heuristics and in Appendix B the Fast-Path and Slow-Path 
match-time procedures. 

2. Related Work 

Multiple exact string-match is a classical problem with applications in 
many fields, ranging from Information Security [5, 6, 7, 8] through Inter- 
net Services [9, 10], Bioinformatics [11, 12] and others. The diversity of 
the different solutions to this problem is impressive. In this paper we in- 
formally describe the notion of Consume- Order Dependency, which we 
claim is characteristic of the majority of existing solutions to this problem. A 
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Consume- Order Dependent solution would typically include a data-structure 
preprocessed from the pattern strings and a match-time algorithm that would 
traverse this structure along with the input string, but would also implicitly 
dictate the order of traversal over the input string. A Consume- Order Agnos- 
tic solution, on the other hand, would be free of the traversal order constraint, 
and there would exist several equally efficient algorithms, which would be able 
to utilize the same preprocessed data-structure in different traversal orders, 
and still arrive at the same correct match results. Note that in this paper 
we do not give a formal definition of Consume- Order Dependency, nor do 
we prove the superiority of Consume- Order Agnostic solutions versus other 
solutions - the notion is used primarily for classification. Nevertheless, we 
intuitively suggest that in general, solutions imposing a constraint that is 
immaterial of the original problem they aim to solve, may potentially be less 
efficient than solutions that are free of this constraint. 

The de-facto industry standard for multiple exact string match is the Aho- 
Corasick algorithm [13]. This, and its family of derived algorithms [14, 15], 
variants and optimizations [10, 16], are all inherently Consume-Order De- 
pendent: they assume traversal of the input in atomic steps following a well- 
defined order, usually with no option of re-examining an already traversed 
symbol. 

The Wu-Manber algorithm [17] extends the Boyer-Moore [18] single- 
pattern match algorithm to multiple patterns. It belongs to a separate 
family of skip-table algorithms, which preprocess one or more tables for in- 
dicating how many symbols can be skipped based on previously consumed 
symbols. This family also has many variations, improvements and applica- 
tions [19, 20, 21, 22]. Of course, all of them again exhibit Consume-Order 
Dependency: the skip-table is calculated assuming that skips are always in 
the same direction. A different skip-table would have to be built if consume- 
order is reversed. 

String-hashing algorithms like Rabin-Karp [23] and Muth-Manber [24] 
treat strings as keys to a hash-function, and use the hash-value to find a 
subset of possibly matching strings, all sharing the same hash-value. Here, 
Consume-Order Dependency is a necessity if rolling-hash functions are chosen 
(which is usually the case, for performance reasons). Although the input 
can be consumed in ANY order, and the hash-function can be calculated 
at any point in the input, only a left-to- right order or a right-to- left order 
allow a more efficient rolling-hash calculation. Nevertheless, Bouma2 has 
considerable affinity to this family of algorithms, as it also maps strings to 
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"hash-values" and resolves collisions at match-time. The differences are: 

f . The hash-function is tailor-made using optimization techniques 

2. Hash- values are always 2- symbol substrings of the key string 

3. Every key string is mapped twice to hash- values 

4. Collision-resolving takes into account the already matched 2-symbol 
substring for relative offset information 

Bouma2 is also affiliated with filtering algorithms like Q- Grams [25] and 
Bloom-Filters [26, 27, 28], which may exhibit Consume- Order Independence 
during the filtering phase. Traversing the input in non-overlapping 2-grams is 
very close to the Bouma2 Fast-Path procedure (see Section 5). Nevertheless, 
we have not seen any documented post-filtering phase that is Consume- Order 
Agnostic. 

It is also important to note that our approach attempts to be input- 
sensitive at the expense of losing generality. The Bloom-Filter approach, for 
example, assumes that choosing a uniformly-distributed hash and reducing 
false-positives for the random case (see [29]) would improve performance. 
Our claim is that optimizations that are aware of input characteristics will 
be more effective in real-life situations. 

3. Basic Definitions 

An alphabet £ is a nonempty set of symbols. A word over £ is a finite 
sequence of symbols of £ . The empty word is denoted by e and the length 
of a word w is denoted by \w\ . The set of words over £ is denoted by £* . 
A language L is any subset of £* . The set of all 2-symbol words is denoted 
by £ 2 . The total length of all words in a language (%2 w€L \ w\) is denoted 
by sz(L) (refer to [30] for more details). 

Definition 1 (The Multiple Exact String Match Problem). Given a 
language ICE* and a long word Wj G £*, find all occurrences of any of 
the words in L that are substrings of Wj. 

3.1. Traces and Motifs 

Given a language, Bouma2 first identifies the complete set of all 2-symbol 
substrings of all the words (termed Traces). Then, every word is mapped 
to two of its own 2-symbol substrings, or Motifs. There are 2 mappings per 
word, one at an even offset and one at an odd offset. The following describes 
the relationships between words, traces and motifs. 
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Definition 2 (Trace-Set). For a given language L , the set Tl C £ 2 that 
satisfies: 

t £ T L -<=>- 3«; £ L an<i 3u> p ,u! s 6 E* : w = w p tw s . (1) 

is named the Trace-Set of L . Any t £ Tl is named a Trace. 

Definition 3 (The Trace Occurrence Function). The function occ that 
satisfies: 

occ : 

occ(u>, t, I) — 1 -<=>- iw = Wptw s A = / . (2) 
is named the TTie Trace Occurrence Function of L . 

Definition 4 (The Trace Association Functions). The following func- 
tions, assoco and assoc\, are respectively named the Even and Odd Trace 
Association Functions, and are defined as: 

assoco,associ : L x £ 2 — > {0, 1} , 

LM/2J 

assoc (w, t) = 1 -<=>• occ(w, t, 21) > . (3) 

1=0 

L(M-i)/2j 

associ(iw,t) = 1 -<=^> occ(w,t,2l + 1) > . (4) 

«=o 

Definition 5 (Motif-Set). Given a trace-set Tl, any C Tl that satis- 
fies for every w £ L: 

assoco(w,t) > 1 A associ(u;,t) > 1 . (5) 
teM L t€M L 

is named a Motif-Set of L . A Motif is every trace /i £ M^. 

Proposition 1 (Motif-Set Existence). There exists a motif- set for every 
L satisfying: 

£ L : \w\ > 2 . (6) 
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Proof. By example: consider the trace-set Ml, in which the condition in 
Definition 5 is inherently satisfied: 



M L := (J {t : occ(w, t, 0) = 1} U (J {t : occ(w, i, 1) = 1} . 



(7) 




□ 



Example 1 (Motif-Sets). Consider the following 6- word language of 8-bit 
characters. 



Table 1 shows the trace-set for this language. It also shows 3 motif-sets: the 
first one attempts to map all the words using as few traces as possible, the 
second tries to avoid traces that are frequent in the word-set, and the third 
tries to avoid traces that are expected to be frequent in the input (using trace 
occurrence statistics collected in advance). See Section 4.1 for an explanation 
of the motif-selection process. 

3.2. Resolve-Sets 

The process of deducing the correct word from a motif occurrence can be 
compared to collision resolving of a hash-function. There may be many words 
mapped to the same motif. Upon encountering a motif, the match procedure 
must select the correct match (if there is a match) out of the specific word-set 
mapped to the motif. The following defines the hash-function and the set of 
words mapped to a single motif. 

Definition 6 (Motif-Set Hash Function). Any surjective function Hm l 
such that: 



Definition 7 (Resolve-Set). For a given motif /x G Ml and hash-function 
H Ml , /i's Resolve-Set C L, is: 



L = {herd, herbal, upper, deeper, error ,ferrarri 2 } . 



(8) 



H Ml : L x {0, 1} — » M L 
H~M L (w,i) = n =>■ associ(w , fj) = 1 . 



(9) 



Rft := {w : H Ml (w, 0) = fj, V H Ml (w, 1) = ji} . 



(10) 



Intentional misspelling. 
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(prefer rare motifs 


(prefer rare motifs 






in strings) 


in input) 


f e 




f e 




ee 




ee 




ep 








ra 








rb 




rb 


rb 


rd 




rd 


rd 


ri 








ro 








al 








ar 








ba 








up 




up 




PP 








de 








or 




or 




he 


he 






pe 


pe 




pe 


rr 


rr 




rr 


er 


er 


er 


er 



Table 1: Motif-Sets: The leftmost column shows the complete trace-set for Example 1, 
while the other columns specify 3 different selections of motifs. The first selection attempts 
to form the smallest possible motif-set. The second selection tries to find motifs that appear 
as few times as possible within the pattern strings. The last selection tries to find motifs 
that ar expected to appear as few times as possible within the expected input string (based 
on statistics collected in advance). 
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Example 2 (Resolve-Sets). Resolve-sets for each one of the motifs in Ex- 
ample 1 are shown in Table 2. Note that for a given motif- set, there may be 
many valid resolve-sets: for the minimum motifs resolve-set in this example, 
instead of having an odd-motif mapping of f errarri to rr, we can map this 
word to er without changing the motif selection. See Section 4.2 for details. 

4. The Bouma2 Preprocessing Stage 

As stated, Bouma2 essentially implements a hashing scheme for mapping 
the words in a language onto a motif-set. There are several approaches to 
constructing this mapping. The approach that is described in this paper 
relies on linear optimization, and consists of 3 steps: 

1. Select an optimal motif-set out of the complete trace-set. 

2. Remove duplicate mappings. 

3. Build a Mangled- Trie for each motif. 

4-1. Optimizing Motif Selection 

The problem of selecting the best motif-set from a given trace-set can 
be formulated as a weighted Integer Linear Programming [31] problem 3 . 
We use cost functions for motifs in order to optimize the selection for spe- 
cific needs. Different cost functions serve different purposes, like improving 
performance, reducing memory footprint, or speeding up preprocessing time. 
Table 1 shows 3 different motif selections over the same trace-set. 

Definition 8 (The Bouma2 ILP Formulation). Given a Motif Cost 
Function c : S 2 — > 1R, we define Ml, c C Tl as a Minimizing Motif 
Set solving the following Integer Linear Programming problem: 

Minimize ^ c ( t ) x < : x t e {°» X ) Vt e T l 

t&T L 

subject to : Ww G L : 

x t ■ assoco(w, t) > 1 A x t ■ associ(w, t) > 1 . (11) 
t&T L teT L 



3 In a previous paper [32] we described how this problem is analogous to a Weighted 
Clique-Partitioning problem. 
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Resolve-Set 
Name 


Resolve-Set 
Contents 
(minimum motifs) 


Resolve-Set 
Contents 
(prefer rare motifs 
m strings) 


Resolve-Set 
Contents 
(prefer rare motifs 

j \ 

m input) 


Rfe 




{ ferrarri } 




Ree 




{ deeper } 




Rrb 




{ herbal } 


{ herbal } 


Rrd 




{ herd > 


{ herd > 


Rup 




{ upper } 




Ror 




{ error } 




Rhe 


{ herd , 
herbal } 






Rpe 


{ upper , 
deeper } 




{ upper , 
deeper } 


Rrr 


{ ferrarri , 
error , 
ferrarri } 




{ ferrarri , 
error , 
ferrarri } 


Rer 


{ herd , 
herbal , 
upper , 
deeper , 
error } 


{ herd , 
herbal , 
upper , 
deeper , 
error , 
ferrarri } 


{ herd , 
herbal , 
upper , 
deeper , 
error } 



Table 2: Resolve-Sets: The leftmost column lists all the motifs that belong to any of the 
3 sets in Table 1. The other 3 columns show, for each motif-selection scheme, the resolve- 
sets per motif. Two carets mark the motif position in each motif-set. Note that selecting 
the motif-set does not dictate the resolve-sets per motifs; for example, in the Minimum- 
Motifs resolve-sets, ferrarri belongs to the rr resolve-set, but could also belong to the 
er resolve-set (these two options are interchangeable since both motifs appear at an odd 
offset within ferrarri). 
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Example 3 (Minimizing Motif False- Positives). Cost functions can facili- 
tate the use of statistics gathered on the input string and language. For 
example, the conditional probability P(w \ t), i.e. the probability of the 
word w appearing in the input string, given that the trace t was observed, 
can be used as a weight (note that here we need to maximize the conditional 
probability, so we negate the cost function): 

c(t) = — ^^(assoco(w, t) V associ(w, t)) ■ P(w \ t) . (12) 

Example 4 (Memory Cost Function). Subject to proper implementation, 
minimizing the number of motifs may help reduce overall memory require- 
ments: 

c(t) = 1 . (13) 

4-2. Removing Duplicate Mappings 

The motif-selection process may not provide a unique mapping of words 
to motifs: for a given motif-set, the same word may be mapped to more 
than 2 motifs in the set. We need to make sure that each word has only two 
mappings 4 , otherwise the match-time algorithm would generate duplicate 
reports for the same match. Redundant mappings may be to distinct motifs 
or to the same motif. For example, the string http://www.wwwdotcom.com 
may be mapped 4 times to ww: twice to an even offset and twice to an odd 
offset. Also observe Table 2: for the motif selection with minimum motifs, the 
word f errarri could be either mapped to rr or to er - both are legitimate 
odd motif mappings (although note the difference in the depth of the resulting 
mangled-tries - see Section 5). When selecting which mapping to remove, we 
consider symbol occurrence statistics, complexity of the resulting mangled- 
tries, relative offset of motifs etc. 

4-3. Mangled- Trie Construction 

When several pattern strings map to the same motif, upon finding this 
motif the match procedure has to decide how many of them (zero or more) 
actually match, and also report the exact match position for every match- 
ing string. For this purpose we have developed the Mangled- Trie data- 
structure. The mangled-trie is a special decision-tree, which dictates the 



4 This rule has exceptions: e.g. for case-insensitive text matches (see 8.1), Bouma2 can 
accept up to 8 mappings per word. 
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next position to examine in the input, while proceeding along any one of 
its branches. This is opposed to a regular ihe[33], which proceeds in a pre- 
defined consume-order (evidently through the use of a Consume- Order De- 
pendent algorithm). Essentially, the mangled-trie is built over an extended 
set of symbols, that includes the relative offset dimension. This allows us to 
predefine the sequence of examined symbols as offsets relative to the motif 
match. 

Definition 9 (The Mangled-Trie Symbol Set). For a given language L, 
motif fi G Ml and resolve-set C L, the corresponding Mangled-Trie 
Symbol Set is: 

E mt l,„ ={(w,i):w E -r^ < i < \w\ - i^0,i^l} 
where l w '» satisfies: occ(w, fjt, l w ^) = 1 . (14) 

The different types of mangled-trie nodes can be distinguished according 
to the following terminology: 

1. State: This node is associated with a single offset relative to the mo- 
tif position. It maps transitions to child nodes along the mangled- 
trie according to the symbol value at that offset. Any path along the 
mangled-trie may include no more than one state per offset. 

2. Transitional: This node is a special type of state: it indicates a match 
of a single string that resulted from consuming the symbol during the 
last transition. No further matches are required for this specific string, 
but if it is a substring of any other string in the mangled-trie that was 
not ruled out yet, the match proceeds. 

3. Terminal: This node holds information about remaining fragments 
of a single string, whereas all the other strings were already ruled out 
while traversing the mangled-trie. If all the fragments match, this node 
produces a successful match result for its associated string. 

4. Pivot: This node indicates a "fallback" state, that has to be visited 
after the current path is consumed. This occurs when there are two 
disjoint sets of string fragments - fragments that reside to the left of the 
motif position, and fragments that reside to the right, with no single 
string having fragments on both sides. Any of the above node types 
may also be a pivot. Also, any of the above node types may proceed 
to a pivot, but there cannot be more than one pivot along a single 
mangled-trie path. 
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The heuristic described in Appendix A constructs a mangled-trie from a 
given resolve-set. It recursively selects a "scoring" offset, and examines the 
possible 'interesting' symbol values at that offset. Every unique symbol in 
the scoring offset is considered separately, given that if this symbol would 
be found in the input at that offset, it will allow us to discard all the words 
that do not match this symbol. A subtrie is constructed for the remaining 
symbols belonging to the remaining words. 

Theorem 1 (Maximum Mangled-Trie Depth). Letw max be the longest 
word in L (\w max \ > w,Vw G L). Then the depth of any mangled-trie over 
L satisfies: 

depth(M.T. L ^) < 2 • (\w max \ - 2) . (15) 

Proof. Every offset that is examined against a mangled-trie state allows us 
to eliminate all the strings that do not match its actual value. Therefore, if 
we have more offsets with overlapping strings, we can eliminate strings in 
fewer steps and our mangled-trie will be shallower. The worst case is when 
we have no overlapping offsets at all ( except for the motif position itself, at 
offsets and 1 ): 

Ww G : occ(w, fi, \ w\ — 2) = 1 A (w, 2) ^ £ m t l, m 

V occ(w, fi, 0) = 1 A (w, —1) ^ S mt l, m . (16) 

Obviously, in this case we would have to examine the input both before the 
motif match and after it, no more than \m max \ —2 symbols in each direction. 

□ 

Example 5 (Mangled-Trie Construction). The following example describes 
the recursive construction of a mangled-trie out of a given resolve-set. As- 
sume that the "scoring" offset is given to us at every iteration 5 . Consider R er 
for the minimum-motifs option in Table 2 (note that we strike through the 
symbols at the motif position, to emphasize that they will not be considered 
when building the mangled-trie): 



5 Choosing different scoring offsets would generate different manglcd-tries, possibly giv- 
ing room for further optimization of memory consumption or mangled-trie depth. 
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{ herd , 
herbal , 
upper- , 
deepen , 
er-ror } 

Assume that the first scoring offset is -1. At this offset, there are only 3 
'interesting' possibilities of symbol values: h, p or any other symbol. We 
examine each possibility separately: if h is found at offset -1, then all strings 
requiring p at this offset (i.e. upper and deeper) can be 'purged'. If p is 
found, then all strings requiring h (i.e. herd and herbal) are purged. If any 
other symbol is found, we can purge all strings that require a specific symbol 
at offset -1, i.e. strings containing either h or p (specifically, herd, herbal, 
upper and deeper). Adhering to the notation in Algorithm 2, we specify 
the set of unique symbols for offset -1 as: A <— {h,p}. We first purge by h 
(the ' . ' sign specifies a symbol that is ignored either because it belongs to a 
purged string, or because the offset it resides in was already examined): 

{ .erd 

.■er-bal , 



error } 

We are now left with the strings erd, erbal and error. We will now re- 
cursively create a 'sub-mangled-trie' that would specifically resolve matches 
for these strings. For the purged subtrie, assume the scoring offset is 2, giv- 
ing A i— {d,b,r}. Encountering any one of these symbols at offset 2 would 
cause the elimination of all the words that do not contain it (e.g. b eliminates 
erd and error). Encountering d successfully terminates the match without 
requiring further validation, reporting herd. We therefore add a Transi- 
tional node for d. b still requires extra substring matching, so we add a 
Terminal for b, validating al at offset 3. Similarly, we add a Terminal 
for r, validating or at offset 3. Note that if the input at offset 2 contains any 
symbol other than d,b or r, we conclude with no match. Returning to the 
root state, we now purge by p: 
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{ 



up .er- , 
dee.-e* , 
■e^ror } 

We are left with the following string fragments: up at offset -3 for upper, 
dee at offset -4 for deeper, and ror at offset 2 for error. We thus have two 
disjoint subsets of words, each one on a different side of the motif match. If 
the next scoring offset is chosen to the left (i.e. either -4, -3 or -2), then the 
resulting subtrie would handle only the fragments on the left, and its 'leaves' 
would be followed by a Pivot State for handling the resolving of the right 
side. Alternatively, if the scoring offset is chosen on the right (i.e. either 2, 
3 or 4) then a subtrie would handle the single remaining fragment on the 
right, and would be followed by a Pivot State that would handle the left 
side. Assume the scoring offset this time is -2. A i— {p,e} for this offset, 
and again each symbol choice uniquely identifies one of the two words on the 
left. This allows us to add the corresponding Terminals (u at offset -3 if p 
is found, and de at offset -4 if e is found) in order to complete the match. 
But this time, no matter if we find a match on the left side or not, we also 
have to match another Terminal on the right side (namely ror at offset 2). 
This Terminal is added as a Pivot State. Finally, we return once more 
to the root state at offset -1 and purge for the "fallback" case (i.e. neither h 
nor p): 

{ .... , 



eiror } 

We are left with a single fragment: ror at offset 2 for matching error. We 
need to add a Terminal for matching this fragment, but it is identical to 
the Pivot State we added earlier, so the two nodes can be consolidated. 
The complete mangled-trie is shown in Figure 1. 
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Figure 1: Mangled- Trie for R er in Example 5. 



5. The Match Process 

The preprocessing stage constructs two separate data-structures: a map 
for searching motif occurrences and a set of mangled-tries for resolving motif 
matches. It is beneficial to separate motif finding from motif resolving, be- 
cause in practical applications this allows reuse of cached data. We adopt the 
notion of Fast-Path and Slow-Path common in networking (see [34]) to 
differentiate between the time-critical, deterministic process of motif finding 
and the input-sensitive process of motif resolving, respectively. Appendix B 
presents the match-time procedures. 

Theorem 2 (Bouma2 Memory Consumption). For a given language L 
and a motif-set Ml, the amount of memory required for the Bouma2 match 
structures is 0(4 ■ \L\ ■ (\w max \ - 2) + \M L \). 

Proof. The Fast-Path procedure queries data based on the motif-set, hence 
\Ml\. For the Slow-Path, every word is represented by two separate paths 
along one or two mangled-tries. By Theorem 1, such a path would consume 
up to 0(2 ■ (|u; maa: | — 2)) memory, hence the result above. □ 
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Theorem 3 (Bouma2 Worst-Case Complexity). Let the aggregate oc- 
currence probability of any of the motifs in M L be Pu L = YlaeM L -^V Then 
the worst-case complexity of the Bouma2 match procedure over an input of 
length n is: 

0(n- (0.5 + P Ml ■ (\w max \ -2))) . (17) 

Proof. During Fast-Path, n/2 comparisons are made over the input. Then 
the number of motif matches is Pm l ■ n/2. In the worst case, the Slow-Path 
would traverse the full depth of the mangled-tries for every motif occurrence, 
so the number of Slow-Path operations according to Theorem 1 is: 

(P ML -n/2)-(2-(\w max \-2)) . (18) 

The sum of the Fast-Path operations and the Slow-Path operations yields the 
result above. □ 

Theorem 3 demonstrates the importance of optimizing the motif-set qual- 
ity: for a given word-set, we can improve the worst-case match performance 
simply by selecting a better set of motifs. 

6. Experimental Results 

Snort® 6 is a popular Open-Source Intrusion Prevention System main- 
tained by SourceFire, Inc. It is an excellent case-study for extensive use of 
multiple exact string-match: Snort uses Aho-Corasick for initial fast filtering 
of cyber-attack signatures. Although the Aho-Corasick variant that is built 
into Snort was originally introduced as a performance optimization, it is in 
itself one of the largest bottlenecks of this software (see [6]). 

It is claimed in [35] that up to 70% of the Snort execution time is spent 
on various string-matching algorithms. In the experiments we conducted 
against Internet Service Provider packet captures, the default Snort Aho- 
Corasick variant accounted for most of the string-matching overhead, taking 
around 40% of the total execution time. 

Snort uses Aho-Corasick as follows: it builds several unique AC state- 
machines, based on the various TCP/UDP port groups specified in any of 



6 http : / / www . snort . org 
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its rules, and an extra one for rules that are not port-specific. This allows 
initial port-based traffic filtering, and also probably reduces the total memory 
consumption and helps in localizing memory accesses compared with a single 
AC for all strings. On the other hand, this means that some strings may be 
accounted for in several distinct AC instances. Nevertheless, we test the 
AC against a single Bouma2 instance (which is inherently localized), with 
a single representation of each unique string (together with a reference to 
its duplicates). It should also be mentioned that some of the AC instances 
(around 3.5% of the entire set) are built to perform 2-symbol strides. We 
specify the number of AC instances built in each test. 

We used Snort v2.9.1.1 source code for Windows. For rules we used the 
v2.9.1.1 rule-set released on Oct. 6 th 2011. In order to comply with the 
Bouma2 version under test, which accepts only strings 3 bytes long or more 
(see 8.6 for a discussion on short-string support), we identified and disabled 
all Snort rules that contain 2-byte and 1-byte strings. In the Snort code we 
implanted calls to the Bouma2 API, to allow running the 3 Bouma2 vari- 
ants on the same input that AC was receiving. The Bouma2 preprocessor 
and matcher were written in C++ using Microsoft® Visual Studio® 2010 Pre- 
mium. The motif selection process was implemented with source-code from 
the COIN-OR[36] BCP[37] project 7 . 

All tests were done on a Dell™ computer with Intel® Core™ 2 Duo CPU 
2.53 GHz with 1.95 GB RAM, running Windows XP SP3. In our tests 
we use the Microsoft^ Visual Studio® 2010 Premium Sampling Profiler. We 
conducted 5 different tests, using the default Snort rule-set and increasing the 
number of enabled rules with every test. As input we used a packet capture 
of traffic sampled at a large Internet Service Provider site. We applied the 
packets to Snort using the -r option. For the Rare Motifs in Input variant 
we used statistics gathered on one third of the entire capture. Match results 
were verified to be identical in a dry run. 

We compare performance by means of Algorithm Throughput, taking 
into account the processor frequency (2.53 GHz), the sample interval (every 
10,000,000 clock cycles), the number of bytes sent to the match procedure 
(around 1 GByte, actual value calculated during the test) and finally the 
number of samples recorded by the profiler for each match procedure. We 
use Equation 19: 



https : //pro j ects . coin-or . org/Bcp 
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Samplelnterval 
Throughput 



10,000,000 



3952.569/^sec. 



2.53 • 1,000 

BytesConsumed ■ 8 

SampleCount ■ Samplelnterval 

BytesConsumed 

0.002024 • — ^ — Mbits sec . 

bampleC ount 



(19) 



We built 3 different versions of Bouma2, with 3 different cost functions 
(see Definition 8): 

1. Minimum Motifs: c(t) = 1 

2. Rare Motifs in Strings 8 : c{t) = Y,weLJ2o<i<\ w \ occ ( w ^^) 

3. Rare Motifs in Input (by occurrence probabilities): c(t) = P(t) 

The results are presented in Tables 3, 4, 5, 6 and 7. Note the similarity 
between the occurrence probabilities calculated from the statistics that were 
gathered on one third of the traffic, and the actual numbers collected during 
the match. 



Test 


Word 


Unique 


Traces 


Words Agg. 


Unique Words 


Snort AC 


No. 


Count 


Words 




Size (bytes) 


Agg. Size (bytes) 


Instances 


1 


657 


578 


1,742 


6,966 


6,301 


201 





Throughput 


Memory 


Motifs 


Motif Occur. 


Motif Occur. 




(MBits/sec) 


(bytes) 




Prob. (est.) 


Prob. (actual) 


AC 


1,877.544543 


2,550,000 








B2-M 


2,972.778859 


524,800 


254 


0.0241563 


0.028288445 


B2-RS 


3,086.110605 


539,392 


354 


0.0221357 


0.024200245 


B2-RI 


3,513.735279 


525,312 


309 


0.0190495 


0.020905283 



Tabic 3: Snort Benchmark (657 strings) 



The throughput comparison is visualized in Figure 2. It is evident that 
Bouma2 achieves around double the throughput compared with AC. More- 
over, choosing the Bouma2 version that is suitable for the input may improve 



8 In the 5th test, the BCP algorithm failed to find a solution for the Rare- Motif s-in- 
Strings variant in reasonable time. 



19 



Test 


Word 


Unique 


Traces 


Words Agg. 


Unique Words 


Snort AC 


No. 


Count 


Words 




Size (bytes) 


Agg. Size (bytes) 


Instances 


2 


1,751 


1,290 


3,655 


24,062 


17,349 


285 





Throughput 


Memory 


Motifs 


Motif Occur. 


Motif Occur. 




(MBits/sec) 


(bytes) 




Prob. (est.) 


Prob. (actual) 


AC 


1,594.655819 


9,090,000 








B2-M 


2,321.656187 


1,168,896 


396 


0.0383088 


0.037200262 


B2-RS 


2,064.91349 


1,138,432 


684 


0.0361115 


0.035557825 


B2-RI 


2,543.281109 


1,181,952 


487 


0.0304228 


0.030030249 



Tabic 4: Snort Benchmark (1,751 strings) 



Test 
No. 


Word 
Count 


Unique 
Words 


Traces 


Words Agg. 
Size (bytes) 


Unique Words 
Agg. Size (bytes) 


Snort AC 
Instances 


3 


2,443 


1,609 


4,041 


31,843 


22,275 


336 





Throughput 


Memory 


Motifs 


Motif Occur. 


Motif Occur. 




(MBits/sec) 


(bytes) 




Prob. (est.) 


Prob. (actual) 


AC 


1,433.464052 


11,480,000 








B2-M 


1,855.21453 


1,531,392 


460 


0.0405084 


0.039606471 


B2-RS 


2,186.104496 


1,497,088 


761 


0.0390466 


0.038530029 


B2-RI 


2,239.971548 


1,507,328 


547 


0.0327172 


0.032419297 



Tabic 5: Snort Benchmark (2,443 strings) 



Test 


Word 


Unique 


Traces 


Words Agg. 


Unique Words 


Snort AC 


No. 


Count 


Words 




Size (bytes) 


Agg. Size (bytes) 


Instances 


4 


4,949 


3,296 


7,795 


77,282 


54,475 


414 





Throughput 
(MBits/sec) 


Memory 
(bytes) 


Motifs 


Motif Occur. 
Prob. (est.) 


Motif Occur. 
Prob. (actual) 


AC 
B2-M 
B2-RS 
B2-RI 


1,052.746683 
1,679.054087 
1,697.318856 
1,763.07455 


28,600,000 
3,259,392 
3,280,640 
3,266,048 


705 
1,058 
814 


0.0469979 
0.047068 
0.0410833 


0.045910587 
0.046318193 
0.040417134 



Tabic 6: Snort Benchmark (4,949 strings) 
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Test 


Word 


Unique 


Traces 


Words Agg. 


Unique Words 


Snort AC 


No. 


Count 


Words 




Size (bytes) 


Agg. Size (bytes) 


Instances 


5 


7,146 


4,841 


8,789 


131,547 


98,546 


424 





Throughput 


Memory 


Motifs 


Motif Occur. 


Motif Occur. 




(MBits/sec) 


(bytes) 




Prob. (est.) 


Prob. (actual) 


AC 


841.9984458 


51,370,000 








B2-M 


1,498.041131 


4,859,136 


862 


0.0520728 


0.050279535 


B2-RS 












B2-RI 


1,697.08156 


4,861,184 


985 


0.0455084 


0.044300929 



Tabic 7: Snort Benchmark (7,146 strings) 



the throughput: preferring motifs that are rare in the input achieves a 13% 
improvement compared with the minimum-motifs version. A comparison 
of the memory consumption is displayed in Figure 3. Evidently, Bouma2 
requires 10 times less memory than the Snort version of AC. 

7. Conclusion 

In this paper we presented Bouma2, which is to our knowledge the first 
Consume- Order Agnostic multiple exact string-match algorithm. This 
approach allows independent (and therefore parallelizable) examination of 
different segments of the input, and as such corresponds well with modern 
processor and other hardware architectures, which allow fast random access 
to a fairly large amount of cached memory. This is opposed to the Consume- 
Order Dependent State-Machine model, which assumes access to a single 
symbol at a time, and as such corresponds better with the theoretical single- 
tape Turing-Machine [38] concept. 

One lesson that we learn from the tests in Section 6 may seem obvious: 
real-life data is not random. We can (and should) rely on premature 
knowledge of data characteristics when applying operations on new data. 
Different types of data have different characteristics, and an adaptive version 
of Bouma2 will be able to learn these characteristics and improve its choice 
of motifs on-the-fly. 

We believe that regarding the problem of improving pattern-match per- 
formance as a linear optimization problem sheds new light on this well- 
researched area. The Branch-and-Cut algorithm has proven itself in our case 
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Throughput 
(Mbit/sec) 



,500.00 



1 .000.00 




<— AC 
•-B2-M 

B2-RS 
— B2-RI 



Total 

String Size 
(bytes) 



10,000 20,000 30,000 40,000 50,000 50,000 70,000 00,000 90,000 100,000 



Figure 2: Throughput comparison for tests in Section 6. 

as a powerful and flexible tool that should be explored and used more. Tak- 
ing; Bouma2 study and applying the same approach to other areas 
of study may also prove beneficial. As an example we give the work in [39], 
which is trying to address very recent needs of Deep-Packet-Inspection over 
compressed input by giving a solution that is tightly coupled with the Aho- 
Corasick scheme. An alternative Consume- Order Agnostic solution applying 
linear optimization for both compression and DPI may also be considered. 



8. Future Work 

8.1. Pandemonium 

Pandemonium 9 is a regular expression matching library, which is pow- 
ered by Bouma2 and follows similar concepts. As with exact string-match, 



The term was chosen as a tribute to Selfridge's cognitive Pandemonium Model [40]. 
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Memory 

Consumption 

(bytes) 




10.000 20.D00 30,000 «,000 50.000 60,000 70,000 8Q.Q0O 9GD0O 10D.ODO (bytes) 



Figure 3: Memory comparison for tests in Section 6. 

regular expression match can also be Consume- Order Agnostic: in essence, 
a regular expression is a set of queries over the input text; ALL the queries 
have to yield a positive result for the match to succeed, regardless of the or- 
der in which they are performed. In many cases, matches can be abandoned 
at an early stage based on negative query results, which may have been ob- 
tained by efficiently leveraging match results from Bouma2. Pandemonium 
can utilize Bouma2's Fast-Path match results to perform advanced matches 
(e.g. case- insensitive match of specific words). Another powerful feature is 
that Pandemonium can accept multiple regular expressions and match them 
in parallel. 

8.2. Performance Improvements 

It is difficult to estimate the nominal impact of improving the motif-set 
on performance. Since the actual Slow-Path performance depends on the 
specific mangled-tries that are accessed, the weight applied when selecting 
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a motif should also be affected by the relative complexity of the resulting 
mangled-trie. We are thus researching methods of improving the motif cost 
functions in this manner (e.g. just like trace occurrence counts, motif-specific 
performance data for evaluating mangled-trie costs can also be collected on- 
the-fly). 

8. 3. Complexity 

It would be beneficial to refine the complexity result obtained in The- 
orem 3, and specifically find bounds that do not rely on a given motif-set 
selection, but rather on the diversity of trace values in the word-set. Also, 
finding bounds for the best-case complexity and its relationship with the 
motif occurrence probability may allow us to estimate a desired occurrence 
probability value and apply it as a target value to the motif-set selection 
process. 

8.4- Algorithmic Attacks 

It is claimed that the Aho-Corasick algorithm is less prone to algorithmic 
complexity attacks because of its deterministic performance [10]. The same 
can be said of the Bouma2 Fast-Path algorithm, which acts as "the first line 
of defense" against such attacks. Nevertheless, while the choice of motifs 
can be optimized for a specific input, an attacker may generate malicious 
input containing "well-known" motifs that will trigger many false-positives. 
Solutions for this problem are being investigated, including throttling Slow- 
Path matches according to motif occurrence rates. 

8.5. Applying Statistics 

Currently, occurrence statistics are applied only in the first preprocessing 
stage (see Section 4.1). We believe it would be beneficial also to perform 
duplicates-removal and implement the B2-CALC-SC0RING-0FFSET ( ) method 
in Algorithm 2 using occurrence statistics, but obviously the benefits require 
further research. Having said that, tuning the Branch-and-Cut algorithm 
performance is essential: the time required to arrive at a solution is not de- 
terministic, and sometimes exceeds 30 minutes with no solution 10 for certain 
sets of statistics. We hope to find an optimization, given that the ILP we 
describe here seems far simpler than the kind of problems that the general- 
purpose Branch-and-Cut algorithm was designed to solve. 



This happened on our 5th test; see Section 6 
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8.6. Short Strings 

Currently, Bouma2 supports strings 3 symbols long and above. Never- 
theless, it has been claimed (see [41]) that one of Aho-Corasick algorithm's 
appealing features is its ability to handle short strings well, as opposed to 
existing alternatives. One proposed solution for Bouma2 is to map single 
symbols to up to 2 • |E| motifs (|E| motifs for even offsets and |E| motifs 
for odd offsets), and similarly map 2-symbol strings to up to 1 + |E| motifs 
(1 motif for even offsets and |E| motifs for odd offsets, essentially expand- 
ing to 3-symbol strings). The effect on memory and performance of this 
enhancement should be examined. 

8. 1. BoumaS and Beyond 

Setting the motif width to 2 symbols is very convenient from a technical 
point of view when dealing with 8-bit symbols (|E| = 256), as in the case 
of Internet traffic inspection or file contents inspection. When we consider 
Computational Biology, the problem-space dictates |E| = 4. Determining 
the motif width in this case should be done based on the trade-off between 
memory and performance constraints on one hand (wider motifs require more 
Fast-Path memory and would decrease Fast-Path performance because of 
cache-misses) and motif uniqueness on the other hand. Obviously, wider 
motifs require more than 2 mappings per word, and also more complicated 
short-string special case handling. 
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Appendix A. Mangled-Trie Construction Algorithm 

Algorithms 1, 2, 3 and 4 detail the construction of a mangled-trie from 
a given resolve-set, as described in Section 4.3. Note that the scoring offset 
calculation procedure and the actual mangled-trie construction procedures 
are not shown here because they are implementation-specific. We denote the 
symbol at offset % relative to the motif position in a word w as a(w,i). 



REM 



REM 



Algorithm 1: BQUMA2-BUILD-MANGLED-TRIE 
Input: 

Output: MTjj 
begin 

Initialize with complete set of offset-symbol pairs: 

M7; B2-BUILD-SUBTRIE( J R /i , S^) 
Optimize memory by removing duplicate nodes: 
B2-CONSOLIDATE-NODES (MT„) 
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Algorithm 2: B2-BUILD-SUBTRIE 



Input: R' S 
Output: MT 
begin 

MT <— 

if \R'\ > 1 then 

Heuristic for determining best offset for resolving: 
^coring ±_ B2-CALC-SC0RING-0FFSET(S) 
A {a(w,i scorin9 ) -.weR'^U {e} 
foreach a consumed G A do 

(W, S', W tranS ) <- B2-PURGE-0FFSET(S, i scorin ^ a consumed^ 

if w trans ^ e then 

A transitional match occurred while consuming: 
B2-ADD-TRANSITI0NAL(MT, w ^ s ) 

(W pivo \ S pivot ) <h- B2-FIND-PIV0T(S', i scorin 9) 
if \W pivot \ = then 

MT' <- B2-BUILD-SUBTRIE(W, S') 
B2-ADD-SUBTRIE(MT, i sc ° rin ^ a con S <w MT ^ 



else 



The Pivot recursively branches along 2 sides of motif: 
MT' <- B2-BUILD-SUBTRIE(W \ W pivo \ S' \ S^) 
B2-ADD-SUBTRIE(MT, i scorin ^ a consumed^ MT ^ 
MT' ±- B2-BUILD-SUBTRIE(H/p™°*, S pivot ) 
B2-ADD-PIV0T-SUBTRIE(MT, ^ co ««9, MT') 



else 



B2-ADD-TERMINAL (MT, W) 
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Algorithm 3: B2-PURGE-0FFSET 



Input: S,i scorin9 ,a consumed 
Output: (W,S',w trans ) 
begin 



w trans ^_ 



All words with symbol mismatch at scoring offset: 

W <- {w : a{w,i scorin9 ) £ { a consumed , e}} 

Inverse of W: 

W <- R'^W 

Look for transitional: 

if 3w' e W : a(w',i') = e, W ^ i scorin9 then 

w trans ^_ w > 

Discard all symbols at scoring offset: 



S'^S\{(w, 



scoring* 



: w e W} 



Discard all words with symbol mismatch at scoring offset: 
S'<-S'\ {{w, i) : (w, i) eS',we W} 



Algorithm 4: B2-FIND-PIV0T 



Input: S,i scoring 
Output: (W pivo \ S pivot ) 
begin 

positive ±_ {(^ f ) . i) eS,i>2} 

positive ^_ ^ w . f w j\ ^ gpositivey 
gnegative ^_ {(^ Q . ^ •) g ^ j < _1} 
^negative ^_ . £\ g ^negative j 

if \YPOsitive p| ^negative _ ^hen 

if i sco ""9 > 2 then 

^ypivot ^ ^/'negative 

Cjpivot ^ ^negative 

else 

yypivot , yypositive 

gpivot ^ expositive 
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Algorithm 5: B2-MATCH-PR0C 

Input: M L ,{J^ ML {MT ll },W I eZ* 

Output: MATCHES 

begin 

HARVEST <- B2-FAST-PATH(M L , Wi) 

MATCHES <- B2-SL0W-PATH(HARVEST, \J M {MTj, 



Algorithm 6: B2-FAST-PATH 

Input: M L , Wi G £* 
Output: HARVEST 
begin 

HARVEST <- 

foreach i E {x : < x < \Wi\ A x%2 = 0} do 
if Wj = Wj\ p tWj\ 8 : \Wi\p\ =i A t G M L then 
L HARVEST <- HARVEST U {(i,*)} 



Appendix B. Match-Process Algorithm 

Algorithms 5, 6, 7 and 8 describe the Bouma2 match-process, as explained 
in Section 5. Implementation-specific procedures for matching symbols and 
substrings are omitted. 
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Algorithm 7: B2-SLQW-PATH 

Input: HARVEST, U^JmTJ, W 1 G S* 

Output: MATCHES 

begin 

MATCHES 

foreach e HARVEST do 

MT M B2-MT-NEXT-TRANSITI0N (MT M , i) 
(MATCHES, MTjf vot ) B2-SP-LOOP(M7) t , i, MATCHES) 
if MTP™ ot ^ then 
(MATCHES, MTP ivot ) 
B2-SP-L00P(MTP™ ot , z, MATCHES) 



Algorithm 8: B2-SP-LQQP 

Input: MT^i, MATCHES 
Output: (MATCHES, MTP ivot ) 
begin 

while MT M ^ do 

if B2-IS- TERMINAL(MT^) then 
MATCHES <- 

MATCHES U B2-M ATCH-TERMINAL (MT M , z) 

else if B2-IS-TRANSITI0NAL(MT fl ) then 
MATCHES <- 

MATCHES U B2-MATCH-TRANSITI0NAL (MT M , i) 

else if B2-IS-PIV0T(MT^) then 
MT pivot ^_ B2-GET-PIV0T(MT /i ) 

MT M <- B2-MT-NEXT-TRANSITI0N(MT M , i) 
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